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Abstract 



This chapter examines the core analytic elements of randomized experiments for social re- 
search. Its goal is to provide a compact discussion for faculty members, graduate students, and 
applied researchers of the design and analysis of randomized experiments for measuring the 
impacts of social or educational interventions. Design issues considered include choosing the 
size of a study sample and its allocation to experimental groups, using covariates or blocking to 
improve the precision of impact estimates, and randomizing intact groups instead of individuals. 
Analysis issues considered include estimating impacts when not all sample members comply 
with their assigned treatment and estimating impacts when groups are randomized. 
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Introduction 

This chapter introduces the central analytic principles of randomized experiments for 
social research. Randomized experiments are lotteries that randomly assign subjects to research 
groups, each of which is offered a different treatment. When the method is implemented prop- 
erly, differences in future outcomes for experimental groups provide unbiased estimates of dif- 
ferences in the impacts of the treatments offered. The method is usually attributed to Ronald A. 
Fisher (1925 and 1935), who developed it during the early 1900s. 1 After World War II, random- 
ized experiments gradually became the method of choice for testing new drugs and medical 
procedures, and to date over 350,000 randomized clinical trials have been conducted (Cochrane 
Collaboration, 2002). 2 

Numerous books have been written about randomized experiments as their applica- 
tion has expanded from agricultural and biological research (e.g., Fisher, 1935; Cochran and 
Cox, 1957; Kempthorne, 1952; and Cox, 1958) to research on industrial engineering (e.g., 
Box, Hunter, and Hunter, 2005), to educational and psychological research (e.g., Lindquist, 
1953, and Myers, 1972) to social science and social policy research (e.g., Boruch, 1997; Orr, 
1999; and Bloom, (2005a). In addition, several journals have been established to promote ad- 
vancement of the method (e.g., the Journal of Experimental Criminology, Clinical Trials and 
Controlled Clinical Trials). 

The use of randomized experiments for social research has greatly increased since the 
War on Poverty in the 1960s. The method has been used in laboratories and in field settings to 
randomize individual subjects, such as students, unemployed adults, patients, or welfare recipi- 
ents, and intact groups, such as schools, firms, hospitals, or neighborhoods. 3 Applications of the 
method to social research have examined issues such as child nutrition (Teruel and Davis, 2000); 
child abuse (Olds, et al., 1997); juvenile delinquency (Lipsey, 1988); policing strategies (Sherman 
and Weisburd, 1995); child care (Bell et al., 2003); public education (Kemple and Snipes, 2000); 
housing assistance (Orr et al., 2003); health insurance (Newhouse, 1996); income maintenance 
(Munnell, 1987); neighborhood effects (Kling, Liebman, and Katz, forthcoming); job training 



'References to randomizing subjects to compare treatment effects date back to the seventeenth century (Van 
Helmont, 1662), although the earliest documented use of the method was in the late nineteenth century for re- 
search on sensory perception (Peirce and Jastrow, 1884/1980). There is some evidence that randomized experi- 
ments were used for educational research in the early twentieth century (McCall, 1923). But it was not until Fisher 
(1925 and 1935) combined statistical methods with experimental design that the method we know today emerged. 

2 Marks (1997) provides an excellent histoiy of this process. 

3 See Bloom (2005a) for an overview of group-randomized experiments; see Donner and Klar (2000) and 
Murray (1998) for textbooks on the method. 
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(Bloom et al., 1997); unemployment insurance (Robins and Spiegelman, 2001); welfare-to-work 
(Bloom and Michalopoulos, 2001); and electricity pricing (Aigner, 1985). 4 

A successful randomized experiment requires clear specification of five elements. 

1 . Research questions: What treatment or treatments are being tested? What is 
the counterfactual state (in the absence of treatment) with which treatments 
will be compared? What estimates of net impact (the impact of specific 
treatments versus no such treatments) are desired? What estimates of differ- 
ential impact (the difference between impacts of two or more treatments) are 
desired? 

2. Experimental design: What is the unit of randomization: individuals or 
groups? How many individuals or groups should be randomized? What por- 
tion of the sample should be randomized to each treatment or to a control 
group? How, if at all, should covariates, blocking, or matching (explained 
later) be used to improve the precision of impact estimates? 

3. Measurement methods: What outcomes are hypothesized to be affected by 
the treatments being tested, and how will these outcomes be measured? What 
baseline characteristics, if any, will serve as covariates, blocking factors, or 
matching factors, and how will these characteristics be measured? How will 
differences in treatments be measured? 

4. Implementation strategy: How will experimental sites and subjects be re- 
cruited, selected, and informed? How will they be randomized? How will 
treatments be delivered and how will their differences across experimental 
groups be maintained? What steps will be taken to ensure high-quality data? 

5. Statistical analysis: The analysis of treatment effects must reflect how ran- 
domization was conducted, how treatment was provided, and what baseline 
data were collected. Specifically it must account for: (1) whether randomiza- 
tion was conducted or treatment was delivered in groups or individually; (2) 
whether simple randomization was conducted or randomization occurred 
within blocks or matched pairs; and (3) whether baseline covariates were 
used to improve precision. 

This chapter examines the analytic core of randomized experiments — design and 
analysis, with a primary emphasis on design. 



4 For further examples, see Greenberg and Shroder, 1997. 
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Why Randomize? 

There are two main reasons why randomized experiments are the most rigorous way to 
measure causal effects. 

They eliminate bias: Randomizing subjects to experimental groups eliminates all sys- 
tematic preexisting group differences, because only chance detennines which subjects are as- 
signed to which groups. Consequently, each experimental group has the same expected values 
for all characteristics, observable or not. Randomization of a given sample may produce ex- 
perimental groups that differ by chance, however. These differences are random errors, not bi- 
ases. Hence, the absence of bias is a property of the process of randomization, not a feature of 
its application to a specific sample. The laws of probability ensure that the larger the experimen- 
tal sample, the smaller preexisting group differences are likely to be. 

They enable measurement of uncertainty: Experiments randomize all sources of un- 
certainty about impact estimates for a given sample (their internal validity). Hence, confidence 
intervals or tests of statistical significance can account for all of this uncertainty. No other 
method for measuring causal effects has this property. One cannot, however, account for all un- 
certainty about generalizing an impact estimate beyond a given sample (its external validity) 
without both randomly sampling subjects from a known population and randomly assigning 
them to experimental groups (which is rarely possible in social research ). 5 



A Simple Experimental Estimator of Causal Effects 

Consider an experiment where half of the sample is randomized to a treatment group 
that is offered an intervention and half is randomized to a control group that is not offered the 
intervention, and everyone adheres to their assigned treatment. Follow-up data are obtained for 
all sample members and the treatment effect is estimated by the difference in mean outcomes 
for the two groups, f T - y , • This difference provides an unbiased estimate of the average 
treatment effect (ATE) for the study sample, because the mean outcome for control group mem- 
bers is an unbiased estimate of what the mean outcome would have been for treatment group 
members had they not been offered the treatment (their counterfactual). 

However, any given sample can yield a treatment group and control group with pre- 
existing differences that occur solely by chance and can overestimate or underestimate the 

ATE. The standard error of the impact estimator ( SE(Yt ~ Yc )) accounls for this random er- 
ror, where: 

5 Two major studies that used random sampling and random assignment are the national evaluations of Head 
Start (Puma et al., 2006) and the Job Corps (Schochet, 2006). 
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( 1 ) 



SE(Yt-Yc) 



given: 





nc 



lix and nc = the number of treatment group members and control group members, 



a 2 = the pooled outcome variance across subjects within experimental groups. 6 



The number of treatment group members and control group members are experimental design 
decisions. The variance of the outcome measure is an empirical parameter that must be “guess- 
timated” from previous research when planning an experiment and can be estimated from fol- 
low-up data when analyzing experimental findings. For the discussion that follows it is useful to 
restate Equation 1 as: 



SE(Yt - Yc ) 



(7 



nP( 1 - P ) 



( 2 ) 



where n equals the total number of experimental sample members (n T + n c ) and P equals the 
proportion of this sample that is randomized to treatment. 7 



Choosing a Sample Size and Allocation 

The first steps in designing a randomized experiment are to specify its treatment, target 
group, and setting. The next steps are to choose a sample size and allocation that maximize pre- 
cision given existing constraints. For this purpose, it is useful to measure precision in terms of 
minimum detectable effects (Bloom, 1995 and 2005b). Intuitively, a minimum detectable effect 
is the smallest true treatment effect that a research design can detect with confidence. Formally, 
it is the smallest true treatment effect that has a specified level of statistical power for a particu- 
lar level of statistical significance, given a specific statistical test. 

Figure 1 illustrates that the minimum detectable effect of an impact estimator is a mul- 
tiple of its standard error. The first bell-shaped curve (on the left of the figure) represents a t dis- 
tribution for a null hypothesis of zero impact. For a positive impact estimate to be statistically 
significant at the a level with a one-tail test (or at the a/2 level with a two-tailed test), the esti- 
mate must fall to the right of the critical t-value, t a (or ion), of the first distribution. The second 
bell-shaped curve represents a t distribution for an alternative hypothesis that the true impact 
equals a specific minimum detectable effect. To have a probability (1 - B) of detecting the 
minimum detectable effect it must lie a distance of F_b to the right of the critical t-value for the 



6 The present discussion assumes a common outcome variance for the treatment and control groups. 

7 Note that Pn equals n T and (l-P)n equals nc. 
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null hypothesis. (The probability (1 - B) represents the level of statistical power.) Hence the 
minim um detectable effect must lie a total distance of t a + ti_B (or t ct/2 + ti_B) from the null hy- 
pothesis. Because t-values are multiples of the standard error of an impact estimator, the mini- 
mum detectable effect is either t a + tj_B (for a one-tail test) or ta/ 2 + h_B (for a two-tail test) times 
the standard error. These critical t values depend on the number of degrees of freedom. 

A common convention for defining minimum detectable effects is to set statistical sig- 
nificance (a) at 0.05 and statistical power (1 - B) at 80 percent. When the number of degrees of 
freedom exceeds about 20, the multiplier equals roughly 2.5 for a one-tail test and 2.8 for a two- 
tail test. 8 Thus, if the standard error of an estimator of the average effect of a job-training pro- 
gram on future annual earnings were $500, the minimum detectable effect would be roughly 
$1,250 for a one-tail test and $1,400 for a two-tail test. 

Consider how this applies to the experiment described above. The multiplier, M n _ 2 9 , 
times the standard error, (y T - y, ) » yields the minimum detectable effect: 

MDE(Yt - Yc ) = Mn- 2 

Since the multiplier M n _ 2 is the sum of two t-values, determined by the chosen of levels 
of statistical significance and power, the missing value that needs to be determined for the sam- 
ple design is that for o 2 . This value will necessarily be a guess, but since it is a central determi- 
nant of the minimum detectable effect, it should be based on a careful search of empirical esti- 
mates for closely related studies. 10 

Sometimes impacts are measured as a standardized mean difference or “effect size,” ei- 
ther because the original units of the outcome measures are not meaningful or because outcomes 
in different metrics must be combined or compared. (There is no reason to standardize the impact 
estimate for the preceding job training example.) The standardized mean difference effect size 
(ES) equals the difference in mean outcomes for the treatment group and control group, divided 
by the standard deviation of outcomes across subjects within experimental groups, or: 



8 When the number of degrees of freedom becomes smaller, the multiplier becomes larger as the t distribu- 
tion becomes fatter in its tails. 

9 The subscript n-2 equals the number of degrees of freedom for a treatment and control group difference of 
means, given a common variance for the two groups. 

10 When the outcome measure is a one/zero binary variable (e.g., employed =1 or not employed =0) the vari- 
ance estimate is p(l-p)/n where p is the probability of a value equal to one. The usual conservative practice in this 
case is to choose p=.5, which yields the maximum possible variance. 
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